Visual Detective Assignment Part 2

Visual Detective R Assignment

This assignment attempts to solve the 2021 IEEE Visual Analytics Science and Technology (VAST) Challenge: Mini-Challenge 2 by applying different visual analytics concepts, methods, and techniques with relevant R data visualisation and data analysis packages.

Archie Dolit https://www.linkedin.com/in/adolit/ (School of Computing and Information Systems, Singapore Management University)
07-25-2021

4. Proposed Solutions

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

Generate an interactive bar graph in descending order using ggplot and plotly to determine the most popular locations.

popular_combine <- cc_loyalty_data %>%
  group_by(location) %>%
  summarize(total_count=n()) %>%
  ggplot(aes(x=reorder(location, total_count),
             y=total_count,
             text = paste("Location :", location,"\n",
                          "Number of transactions:", total_count))) +
  geom_bar(stat="identity", fill = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions") +
  labs(x = "Locations", y = "Transaction Count") + 
  coord_flip() +
  theme_minimal()

ggplotly(popular_combine, tooltip = "text")

Based on the combined combined credit card and loyalty data, the most popular location is Katerina’s Cafe with a total of 256 transactions, followed by Hippokampos with 213 transactions and Guy’s Gyro with 187 transactions.

Generate an interactive heatmap using ggplot and plotly to determine the date and time when employees visit the locations.

day_location_count <- cc_loyalty_data %>%
  count(location, day) %>%
  rename(count = n)

popular_day_location <- ggplot(data = day_location_count,
                               aes(x=day, y=reorder(location, desc(location)),
                                   fill = count,
                                   text = paste("Location :", location,"\n",
                                                "Day of week:", day,"\n",
                                                "Number of transactions :", count))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions by Day") +
  labs(x = "Day of the Week",y = "Locations") + 
  theme_minimal()

ggplotly(popular_day_location, tooltip = "text")

Based on the combined combined credit card and loyalty data, Brew’ve Been Served is popular on weekdays, Monday to Friday, with no transactions on weekend. Probably this location is only open weekday.

Guy’s Gyro, Hippokampos, and Katerina’s Cafe are very popular throughout the week, Sunday to Monday. Katerina’s Cafe is the most popular location on Saturday with a total of 42 transactions.

Some of the interesting transactions are U-Pump with 2 transactions only on Monday and Desafio Golf Course with only 9 transactions only on Sunday.

hour_location_count <- cc_loyalty_data %>%
  count(location, hour) %>%
  rename(count = n)
  
popular_hour_location <- ggplot(data = hour_location_count,
                               aes(x=hour, y=reorder(location, desc(location)),
                                   fill = count,
                                   text = paste("Location :", location,"\n",
                                                "Hour of the Day:", hour,"\n",
                                                "Number of transactions :", count))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions by Hour of Day") +
  labs(x = "Hour of the Day",y = "Locations") + 
  theme_minimal()

ggplotly(popular_hour_location, tooltip = "text")

Based on the time of transaction, Brew’ve Been Served and Hallowed Grounds are popular in the morning around 7AM and 8AM. Most probably the employees visit the place before heading to the office.

Abila Zacharo, Bean There Done That, Brewed Awakenings, Gelatogalore, Guy’s Gyro, Hippokampos, Jack’s Magical Beans, Kalami Kafenion, Katerina’s Kafe, and Ouzera Elian are popular during lunch break around 12NN to 1PM.

Guy’s Gyro, Hippokampos, and Katerina’s Kafe are popular during dinner around 7PM and 8PM. Katerina’s Kafe has the highest transactions at 85 purchases at around 8PM.

Generate an interactive boxplot using plotly to determine the outliers and provide clues on some anomalies.

outlier <- plot_ly(data = cc_loyalty_data,
                   x = ~price,
                   color = I("royalblue4"),
                   alpha = 0.5,
                   boxpoints = "suspectedoutliers") %>%
  add_boxplot(y = ~reorder(location, desc(location))) %>%
  layout(title = "Combined Credit Card & Loyalty Transactions Outliers",
         yaxis = list(title = "Locations"),
         xaxis = list(title = "Price"))

outlier

Based on the price of transaction, it seems that there is unusual expensive purchase at Frydos Autosupply n More amount to 10,000. This is highly suspicious since the mean price for this location is only 161.96 with third quartile value of approximately 250.

Generate an interactive linegraph using plot_anomaly_diagnostics() of plotly to diagnose anomalous points in the cc_data purchase prices. Note that only locations with sufficient number of observations were selected for the anomaly diagnostics.

cc_data %>%
  filter(location %in% c("Abila Airport",
                         "Albert's Fine Clothing",
                         "Carlyle Chemical Inc.",
                         "Chostus Hotel",
                         "Frydos Autosupply n' More",
                         "Gelatogalore",
                         "Nationwide Refinery",
                         "Stewart and Sons Fabrication")) %>%
  group_by(location) %>%
  plot_anomaly_diagnostics(timestamp, price, 
                           .facet_ncol = 2,
                           .y_lab = "Price")

Based on the anomaly diagnostics, there are unusual purchases in Gelatogalore, Frydos Autosupply n’ More, Albert’s Fine Clothing, and Chostus Hotel. Again, the most expensive purchase is from Frydos Autosupply n More amounting to 10,000 on 2014-01-13 19:20:00.

The anomalies will not be removed or corrected. It will be kept in the data since it may lead to more clues in solving the challenge.

Q2: Anomalies in Vehicle, Credit Card and Loyalty Card Data

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Generate an interactive heatmap using ggplot and plotly based on the amount of transactions with missing last4ccnum.

missing_last4ccnum <- cc_loyalty_data %>%
   filter(is.na(last4ccnum)) 

na_last4ccnum  <- ggplot(data = missing_last4ccnum,
                         aes(x=date, y=reorder(location, desc(location)),
                                   fill = price,
                                   text = paste("Location :", location,"\n",
                                                "Date:", date,"\n",
                                                "Total Amount of Transaction:", price))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Transactions with Missing Credit Card Data by Date") +
  labs(x = "Date of Transaction", y = "Locations") +
  theme_minimal()

ggplotly(na_last4ccnum, tooltip = "text")

Based on the total amount of transactions with missing credit card, National Refinery has a transaction on 2014-01-08 with a price of 4367.63. Stewart and Sons Fabrication has a transaction on 2014-01-13 with a price of 4071.95 and another one on 2014-01-15 with a price of 4485.38.

The discrepancies may be due to employees who bought the items with cash instead of credit card but still used the loyalty card to redeem points or rewards.

Generate another interactive heatmap using ggplot and plotly based on the amount of transactions with missing loyaltynum.

missing_loyaltynum <- cc_loyalty_data %>%
   filter(is.na(loyaltynum))

na_loyaltynum  <- ggplot(data = missing_loyaltynum,
                         aes(x=date, y=reorder(location, desc(location)),
                                   fill = price,
                                   text = paste("Location :", location,"\n",
                                                "Date:", timestamp,"\n",
                                                "Total Amount of Transaction:", price))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Transactions with Missing Loyalty Data by Date") +
  labs(x = "Date of Transaction", y = "Locations") +
  theme_minimal()

ggplotly(na_loyaltynum, tooltip = "text")

Based on the total amount of transactions with missing loyalty card, Frydos Autosupply n More has a transaction on 2014-01-13 19:20:00 with a price of 10,000.

The discrepancy is more suspicious since the person who bought the items did not use his loyalty card which may imply possible misuse of the credit card when making the transaction.

Add the gps and car data by creating a movement path from GPS points using the CarIDs as unique identifier. Filter the data around the time of transaction from 2014-01-13 19:00 to 21:00.

gps_path_0113 <- car_gps_sf %>%
  filter(timestamp >= "2014-01-13 19:00" & timestamp <= "2014-01-13 21:00") %>%
  group_by(CarID, date) %>%
  summarize(m = mean(timestamp), 
            do_union=FALSE) %>%
  st_cast("LINESTRING") 

Plot the gps path on the background tourist map and identity which CarIDs are within the vicinity of Frydos Autosupply n More.

gps_path_selected_0113 <- gps_path_0113 %>%
  filter(CarID %in% c("13" , "15", "16", "34"))

tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_path_selected_0113) +
  tm_lines() +
  tm_facets(by = "CarID",  ncol = 1)

From the gps paths, CarID “13” , “15”, “16”, “34” are within the vicinity of Frydos Autosupply n More during suspicions transaction amounting to the price of 10,000.

Create an interactive data table based on the joint gps and car and filter the date to 2014-01-13.

car_gps_0113 <- car_gps_data %>%
  filter(timestamp >= "2014-01-13 19:00" & timestamp <= "2014-01-13 21:00") %>%
  filter(CarID %in% c("13" , "15", "16", "34")) %>%
  group_by(CarID, Deparment, Title, FullName) %>%
  summarise()

DT::datatable(car_gps_0113)

From the interactive table, all CarIDs identified are from the Security Department. Possibly, Isia Vann and Edvard Vann are relatives because of the same Last Name and working together as Perimeter Controller.

Q3: Owners of Credit Card and Loyalty Card

Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?

It is challenging to infer the owners of the credit card and loyalty card since there is no data field to join the credit card and gps data.

One possible approach to plot the gps path and check it against timestamp of the credit card transactions. This approach assumes that the person driving the car is the same person making the credit card transaction. Additionally, it assumes that the gps coordinates, timestamp as well as the credit location and timestamp are accurate.

To implement this proposed approach, it assumes the geospatial tracking software installed in the employees car will stop tracking if the vehicle is not moving. Borrowing from concept of ‘Point of Interest’ (POI) from Virginia Tech, POI is identified if the employee stops for more than 5 minutes.

Identify the POIs by computing the difference of gps timestamp. If the difference is greater than 5 minutes, it will be set to poi = TRUE.

gps_poi_sf <- car_gps_sf %>%
  group_by(CarID) %>%
  mutate(diff = timestamp - lag(timestamp, order_by=CarID)) %>%
  mutate(poi = if_else(diff > 60*5,  TRUE, FALSE)) %>%
  filter(poi == TRUE) %>%
  ungroup() 

glimpse(gps_poi_sf)
Rows: 3,067
Columns: 11
$ timestamp <dttm> 2014-01-06 06:53:01, 2014-01-06 07:05:01, 2014-01~
$ CarID     <fct> 4, 35, 4, 10, 34, 26, 20, 19, 18, 12, 32, 33, 3, 7~
$ date      <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~
$ day       <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ hour      <int> 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,~
$ geometry  <POINT [°]> POINT (24.86419 36.07333), POINT (24.87336 3~
$ Deparment <chr> "Executive", "Executive", "Executive", "Executive"~
$ Title     <chr> "SVP/CFO", "Environmental Safety Advisor", "SVP/CF~
$ FullName  <chr> "Ingrid Barranco", "Willem Vasco-Pais", "Ingrid Ba~
$ diff      <drtn> 1057 secs, 1920 secs, 1320 secs, 1980 secs, 1633 ~
$ poi       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR~

Plot the POIs on the backgroup tourist map and identity the employee locations. From the Combined Credit Card & Loyalty Transactions by Hour of Day heatmap of Question 1, the earliest transaction is around 3:00 AM from Kronos Mart while the last transaction is 10:00 PM from Hippokampos. This information can be used to limit the number of POIs.

gps_poi_points <- gps_poi_sf %>%
  filter(hour >= 3 & hour <= 23) %>%
  select(timestamp,
         CarID,
         Deparment,
         Title,
         FullName)

tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_poi_points) +
  tm_dots(col = 'red', border.col = 'black', size = 1, alpha = 0.5, jitter = .8) +
  tm_facets(by = "FullName", ncol = 1)

Create an interative data table based on the joint credit card and infer the owner based on the POI plot.

cc_owner <- cc_data %>%
  select(timestamp, location, last4ccnum)

DT::datatable(cc_owner)

Finally, manually map the credit card transaction purchases timestamp against the POI map. This approach is very time consuming since it entails manual effort. As for most cases, the POI map may show certain points around the vicinity of the location but does not have any corresponding credit card purchases.

The owner of the loyalty card will be known after the credit card owner has been identified. As noted earlier, there is no one-to-one correspondence between the credit and loyalty card.

The heatmap below visualizes the number of transactions between specific credit card and loyalty card to know the highest likelihood of the loyalty card owner.

cc_loyal_count <- cc_loyalty_data %>%
  group_by(last4ccnum,loyaltynum) %>%
  summarise(count=n())
  
cc_loyal_correlate <- ggplot(data = cc_loyal_count,
                               aes(x=loyaltynum, y=as.factor(last4ccnum),
                                   fill = count,
                                   text = paste("Last 4 Credit Card Number :", last4ccnum,"\n",
                                                "Loyalty Card Number:", loyaltynum,"\n",
                                                "Number of transactions :", count))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Correlation of Credit Card & Loyalty Card by Number of Transactions") +
  labs(x = "Loyalty Card Number",y = "Credit Card Number") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90))

ggplotly(cc_loyal_correlate, tooltip = "text")

From the heatmap, it can seen that certain combination of credit card number and loyalty card are most often used together. Example of which are 6901 and L9363 with 28 transactions; 7117 and L6417 also with 28 transactions.

Nevertheless, even for credit card 6901, there are still 9 transactions when loyalty card was not used. Similarly, credit card 7117 has 3 transactions when loyalty card was not used.

Click HERE to view the Visual Detective Assignment Part 3.

Citation

For attribution, please cite this work as

Dolit (2021, July 25). Visual Analytics & Applications: Visual Detective Assignment Part 2. Retrieved from https://adolit-vaa.netlify.app/posts/2021-07-26-assignment-2/

BibTeX citation

@misc{dolit2021visual,
  author = {Dolit, Archie},
  title = {Visual Analytics & Applications: Visual Detective Assignment Part 2},
  url = {https://adolit-vaa.netlify.app/posts/2021-07-26-assignment-2/},
  year = {2021}
}